Add to_batches() and interpolate() methods to DataFrame #1241

H0TB0X420 · 2025-09-16T19:23:33Z

Add to_batches() as alias for collect() returning RecordBatch list
Add interpolate() method with forward_fill support
Add deprecation warning to collect() method
Add comprehensive tests for both methods
All existing tests pass (with expected deprecation warnings)
Addresses items from RFC RFC: Re-work some DataFrame APIs #875

This is my first PR! Please let me know what I can improve.

- Add to_batches() as alias for collect() returning RecordBatch list - Add interpolate() method with forward_fill support - Add deprecation warning to collect() method - Add comprehensive tests for both methods Addresses items from RFC apache#875

kylebarron · 2025-09-16T19:33:06Z

python/datafusion/dataframe.py

+    def to_batches(self) -> list[pa.RecordBatch]:
+        """Convert DataFrame to list of RecordBatches."""
+        return self.collect()  # delegate to existing method


My opinion (see #1227) is to limit the surface area where we explicitly depend on pyarrow. Especially in this case it's just an alias?

100% agree, this should return arro3 recordbatches

I'm not saying it should even return arro3 RecordBatches... because that's still an external dependency that datafusion is requiring on users. Datafusion could return a minimal batch object that just holds the RecordBatch pointer and then the user transfers it to their library of choice.

I'm not saying it should even return arro3 RecordBatches... because that's still an external dependency that datafusion is requiring on users. Datafusion could return a minimal batch object that just holds the RecordBatch pointer and then the user transfers it to their library of choice

That's possible but I would argue it's more convenient if it's in some way already usable instead of just a pointer. Arro3 is that small that it's negligible in size overall

Perhaps we should get @timsaucer 's thoughts in #1227: is having a required dependency on pyarrow a problem? What should we do about it? Do we want to depend on arro3-core instead? Or have functions that rely on pyarrow but error if pyarrow isn't installed?

@ion-elgreco feel free to write down your thoughts there too

I understand that pyarrow is a large dependency, but I also have a feeling from the community that nearly everyone who is using datafusion-python to do things like to_batches is using pyarrow in the next portions of their code. I'm sure we can always find exceptions.

Now if there is a way we can remove this dependency and it doesn't break existing workflows, that would be even better. I haven't made the time to sit down and play with it, though.

Now if there is a way we can remove this dependency and it doesn't break existing workflows

I think the straightforward way to do that is to remove pyarrow as a required dependency and error if it's not installed. But a separate question is whether we should be adding new methods that explicitly depend on pyarrow.

I could do a try-except to throw an ImportError if it is not installed. But it might make more sense to drop to_batches() from this PR given the dependency discussion. I don't want to add to the PyArrow dependency concerns.

I'll remove it and focus on fixing the interpolate() implementation. I appreciate you taking the time to give feedback!

timsaucer

Thank you for the PR! I understand getting your first PR in a new project can be daunting. I know I have some critical feedback here, but I do appreciate the effort.

python/datafusion/dataframe.py

timsaucer · 2025-09-17T14:53:48Z

python/datafusion/dataframe.py

+    def interpolate(self, method: str = "forward_fill", **kwargs) -> DataFrame:
+        """Interpolate missing values per column.
+        
+        Args:
+            method: Interpolation method ('linear', 'forward_fill', 'backward_fill')


At the outset this doesn't look quite right to me.

The method is called interpolate but the one interpolation method is the one not supported. The others are filling operations, not interpolations.

This looks like it's going to create a very wrong filling - every field in the schema gets sorted and then filled? I would expect we would need an ordering column to determine the filling method.

If we were to do this I think the most pleasant experience would have an enum for the possible values with the simple string conversions necessary.

Two quick questions:

Should we have separate methods for filling vs interpolation?

Do we wait for DataFusion core features before adding Python APIs, or is it okay to create stand-in Python implementations using existing primitives?

Thank you for your time!

python/datafusion/dataframe.py

timsaucer · 2025-09-17T14:56:04Z

python/tests/test_dataframe.py

+    result = df.interpolate("forward_fill")
+
+    assert isinstance(result, DataFrame)


We would probably want to collect the results and verify they fill as expected.

H0TB0X420 · 2025-09-17T17:58:32Z

Thanks for the feedback!
I will revert those changes and deprecation warnings to collect and schema. I will focus on the interpolation and to_batches changes.

kylebarron reviewed Sep 16, 2025

View reviewed changes

timsaucer requested changes Sep 17, 2025

View reviewed changes

		result = df.interpolate("forward_fill")

		assert isinstance(result, DataFrame)

Add to_batches() and interpolate() methods to DataFrame #1241

Are you sure you want to change the base?

Add to_batches() and interpolate() methods to DataFrame #1241

Conversation

H0TB0X420 commented Sep 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylebarron Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H0TB0X420 commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylebarron Sep 17, 2025 •

edited

Loading